Discovering meaning from biological sequences: focus on predicting misannotated proteins, binding patterns, and G4-quadruplex secondary

نویسندگان

  • Carson Michael Andorf
  • David Fernandez-Baca
  • Robert Jernigan
  • Taner Sen
  • Guang Song
چکیده

Proteins are the principal catalytic agents, structural elements, signal transmitters, transporters, and molecular machines in cells. Experimental determination of protein function is expensive in time and resources compared to computational methods. Hence, assigning proteins function, predicting protein binding patterns, and understanding protein regulation are important problems in functional genomics and key challenges in bioinformatics. This dissertation comprises of three studies. In the first two papers, we apply machine-learning methods to (1) identify misannotated sequences and (2) predict the binding patterns of proteins. The third paper is (3) a genome-wide analysis of G4quadruplex sequences in the maize genome. The first two papers are based on two-stage classification methods. The first stage uses machine-learning approaches that combine composition-based and sequence-based features. We use either a decision trees (HDTree) or support vector machines (SVM) as second-stage classifiers and show that classification performance reaches or outperforms more computationally expensive approaches. For study (1) our method identified potential misannotated sequences within a well-characterized set of proteins in a popular bioinformatics database. We identified misannotated proteins and show the proteins have contradicting AmiGO and UniProt annotations. For study (2), we developed a three-phase approach: Phase I classifies whether a protein binds with another protein. Phase II determines whether a proteinbinding protein is a hub. Phase III classifies hub proteins based on the number of binding sites and the number of concurrent binding partners. For study (3), we carried out a computational genome-wide screen to identify non-telomeric G4-quadruplex (G4Q)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Yeast Sub1 and human PC4 are G-quadruplex binding proteins that suppress genome instability at co-transcriptionally formed G4 DNA

G-quadruplex or G4 DNA is a non-B secondary DNA structure consisting of a stacked array of guanine-quartets that can disrupt critical cellular functions such as replication and transcription. When sequences that can adopt Non-B structures including G4 DNA are located within actively transcribed genes, the reshaping of DNA topology necessary for transcription process stimulates secondary structu...

متن کامل

Computational Analysis of G-Quadruplex Forming Sequences across Chromosomes Reveals High Density Patterns Near the Terminal Ends

G-quadruplex structures (G4) are found throughout the human genome and are known to play a regulatory role in a variety of molecular processes. Structurally, they have many configurations and can form from one or more DNA strands. At the gene level, they regulate gene expression and protein synthesis. In this paper, chromosomal-level patterns of distribution are analyzed on the human genome to ...

متن کامل

The conjugative DNA translocase TrwB is a structure-specific DNA-binding protein.

TrwB is a DNA-dependent ATPase involved in DNA transport during bacterial conjugation. The protein presents structural similarity to hexameric molecular motors such as F(1)-ATPase, FtsK, or ring helicases, suggesting that TrwB also operates as a motor, using energy released from ATP hydrolysis to pump single-stranded DNA through its central channel. In this work, we have carried out an extensiv...

متن کامل

Identification of RNA Oligonucleotides Binding to Several Proteins from Potential G-Quadruplex Forming Regions in Transcribed Pre-mRNA.

G-quadruplexes (G4s) are noncanonical DNA/RNA structures formed by guanine-rich sequences. Recently, G4s have been found not only in aptamers but also in the genomic DNA and transcribed RNA. In this study, we identified new RNA oligonucleotides working as aptamers by focusing on G4-forming RNAs located within the pre-mRNA. We showed that the G4 in the 5' UTR and first intron of VEGFA bound to t...

متن کامل

FANCJ promotes DNA synthesis through G-quadruplex structures

Our genome contains many G-rich sequences, which have the propensity to fold into stable secondary DNA structures called G4 or G-quadruplex structures. These structures have been implicated in cellular processes such as gene regulation and telomere maintenance. However, G4 sequences are prone to mutations particularly upon replication stress or in the absence of specific helicases. To investiga...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015